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Abstract 

Diverse classes of proteins function through large-scale conformational changes and 
various sophisticated computational algorithms have been proposed to enhance 
sampling of these macromolecular transition paths. Because such paths are curves in a 
high-dimensional space, it has been difficult to quantitatively compare multiple paths, a 
necessary prerequisite to, for instance, assess the quality of different algorithms. We 
introduce a method named Path Similarity Analysis (PSA) that enables us to quantify 
the similarity between two arbitrary paths and extract the atomic-scale determinants 
responsible for their differences. PSA utilizes the full information available in 
3A-dimensional configuration space trajectories by employing the Hausdorff or Frechet 
metrics (adopted from computational geometry) to quantify the degree of similarity 
between piecewise-linear curves. It thus completely avoids relying on projections into 
low dimensional spaces, as used in traditional approaches. To elucidate the principles of 
PSA, we quantified the effect of path roughness induced by thermal fluctuations using a 
toy model system. Using, as an example, the closed-to-open transitions of the enzyme 
adenylate kinase (AdK) in its substrate-free form, we compared a range of protein 
transition path-generating algorithms. Molecular dynamics-based dynamic importance 
sampling (DIMS) MD and targeted MD (TMD) and the purely geometric FRODA 
(Framework Rigidity Optimized Dynamics Algorithm) were tested along with seven 
other methods publicly available on servers, including several based on the popular 
elastic network model (ENM). PSA with clustering revealed that paths produced by a 
given method are more similar to each other than to those from another method and, 
for instance, that the ENM-based methods produced relatively similar paths. PSA 
applied to ensembles of DIMS MD and FRODA trajectories of the conformational 
transition of diphtheria toxin, a particularly challenging example, showed that the 
geometry-based FRODA occasionally sampled the pathway space of force field-based 
DIMS MD. For the AdK transition, the new concept of a Hausdorff-pair map enabled 
us to extract the molecular structural determinants responsible for differences in 
pathways, namely a set of conserved salt bridges whose charge-charge interactions are 
fully modelled in DIMS MD but not in FRODA. PSA has the potential to enhance our 
understanding of transition path sampling methods, validate them, and to provide a 
new approach to analyzing conformational transitions. 
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Author Summary 

Many proteins are nanomachines that perform mechanical or chemical work by changing 
their three-dimensional shape and cycle between multiple conformational states. 
Computer simulations of such conformational transitions provide mechanistic insights 
into protein function but such simulations have been challenging. In particular, it is not 
clear how to quantitatively compare current simulation methods or to assess their 
accuracy. To that end, we present a general and flexible computational framework for 
quantifying transition paths—by measuring mutual geometric similarity—that, 
compared with existing approaches, requires minimal a-priori assumptions and can take 
advantage of full atomic detail alongside heuristic information derived from intuition. 
Using our Path Similarity Analysis (PSA) framework in parallel with several existing 
quantitative approaches, we examine transitions generated for a toy model of a 
transition and two biological systems, the enzyme adenylate kinase and diphtheria toxin. 
Our results show that PSA enables the quantitative comparison of different path 
sampling methods and aids the identification of potentially important atomistic motions 
by exploiting geometric information in transition paths. The method has the potential 
to enhance our understanding of transition path sampling methods, validate them, and 
to provide a new approach to analyzing macromolecular conformational transitions. 


Introduction 


Protein function is intimately linked with the mechanistic nature of conformational 
transitions—a central problem in computational biophysics is to determine the function 
of a protein given its 3D structure Bi Proteins such as enzymes, molecular motors 
and membrane transporters behave much like nano-molecular machines that perform 
mechanical or chemical work by undergoing conformational transitions between two or 
more metastable states. Large scale conformational changes comprise the slowest 
frequency motions of a macromolecule and can take place on the millisecond time scale 
and beyond. Equilibrium molecular dynamics (MD) is arguably one of the most robust 
approaches to simulating macromolecular dynamics, in large part due to the availability 
of full atomistic detail . It has has been the workhorse tool for studying the protein 
structure-function connection [^. However, conformational transitions are rare events: 
crossing events in the transition region of phase space take place much faster than the 
(waiting) time scales of metastable equilibria, often by several orders of magnitude. 
Equilibrium simulations thus disproportionately sample metastable states instead of 
transition events—the so-called sampling problem—greatly limiting their ability to 
generate conformational transition paths |^. 

Enhanced path-sampling methods and other computational approaches have been 
developed to mitigate the sampling problem inherent to macromolecular transition 
events, permitting the observation of physics on time- and length-scales inaccessible to 
equilibrium MD (see [^12| for reviews). Conformational sampling in trajectory-based 


(i.e., dynamics-based) methods 13 -^ is accelerated by reducing the computational 


cost per time step and/or by minimizing the total number of time steps needed for 
sampling [^. Non-dynamical approaches can be roughly divided into the class of 
minimum (free) energy path (MEP/MFEP) methods [22}j^ , including elastic network 
model (ENM) approaches and prior-information/geometry-based 

algorithms [36}|3^ . A large number of the aforementioned methods overlap 
algorithmically or are similar in spirit; many are also directly amenable (or can be 
adapted) to performing free energy calculations. Presently, however, the full extent to 
which such coarse-grained (CG) or biased MD approaches can replicate physical 
transition ensembles is unknown, especially given the diversity of physical assumptions 


PLOS 


2/40 







•@'PLOS I SUBMISSION 


of the various models. Thus, tools aiding more rigorous inspection of the capabilities 
and effectiveness of path-sampling methods are needed. In a more general sense we need 
a means to compare the protein motions, i.e. the transition paths, in an unbiased 
manner that makes use of all the available structural information. 


Approaches to transition path analysis 

Conformational transition paths are represented by sequences of (snapshots of) 
conformers in 3fV-dimensional configuration space, making it difficult to examine—both 
visually and quantitatively—their character without resorting to dimensionality 
reduction in a collective variable (CV) space. Native contacts analysis (NCA), for 
example, is a general approach frequently used to characterize protein folding 
pathways [40| and enables dimensionality reduction via a projection onto 2D native 
contacts (NC) space. NCA has the property that structural contacts are defined 
without reference to another structure, making NC space projections particularly useful 
when good reaction coordinates are not known a priori. Another common approach is 
principal component analysis (PCA), a tool that can be used to visualize 
conformational dynamics in a lower-dimensional subspace spanned by several principle 
components (PCs) [S,42 . An important aspect of PCA is that motion along PCs can 


be viewed in real space, helping make complicated dynamical motions visually tractable. 

Using NCA, PCA or other CV approaches cannot, however, guarantee that 
important dynamical motions will be captured in the projections—whether (and what) 
dynamical information is lost depends on the projection itself. It is clear that a 
quantitative method that can examine a full 3V-dimensional trajectory would help 
mitigate biases inherent to selecting a coordinate projection. We propose a general 
computational method named Path Similarity Analysis (PSA) to quantitatively 
compare 3V-dimensional macromolecular transition paths, which is based on the idea of 
measuring the geometric similarity between pairs of paths using path similarity metrics. 
Based on distances between paths, trajectories are then clustered by similarity. The 
structural determinants responsible for the difference between any two trajectories are 
extracted at the atomic level by exploiting properties of the underlying metric. Here we 
introduce the PSA approach, examine its suitability, performance, and limitations as a 
computational approach to quantifying path similarity and apply it to a toy system and 
conformational transitions of two proteins. 


Path metrics for measuring transition path similarity 

Path similarity analysis (PSA) exploits the properties of a (path) metric function, 5, 
that measures a distance between a pair of piecewise-linear or polygonal curves, i.e., an 
ordered set of vertices connected by edges. A metric <5 applied to curves A, B, C has 
the properties 


S{A,B) > 0 

(la) 

(5(A,B) =0 A = B 

(lb) 

S{A,B)=SiB,A) 

(Ic) 

S{A,C) < S{A,B) + 6{B,C). 

(Id) 


In particular, Eq. |lb| the identity property, is essential since it implies that, given two 
curves A and H, if H were to be continuously deformed so as to monotonically decrease 
the distance 5{A, B), then (5(A, B) —>• 0 as 5 —>• A. That is, two curves must become 
identical as their mutual distance approaches zero so that decreasing values of <5 


correspond to increasing similarity. The other properties —non-negativity (Eq. la), 
commutativity (Eq. [l^ and triangle inequality (Eq. Id)—guarantee that S behaves in 


PLOS 


3/40 









•@'PLOS I SUBMISSION 


the same way as any other metric usually used in structural comparisons (such as root 
mean squared distance) even though it compares whole paths and not just individual 
conformations. 

PSA does not require the use of true metrics and can be used with any path distance 
function or other dissimilarity measure where only Eqs. |la[|I^ are satisfied. The triangle 
inequality (Eq. |ld[ ), which is a generalization of the transitive property, says that when 
two objects, A and B, in some metric space, are each close to a third object, C, in the 
same space, then A is close to B in the sense that the triangle inequality, 
d{A, B) < d{A, C) + d{B^ C), provides an upper bound on their distance apart. The 
triangle inequality is therefore important when comparing more than two objects, which 
is the common scenario when analyzing many conformational transitions. Although in 
the following we only consider true metrics, we also explore several distance functions 
that violate the triangle inequality in |Sl Tel^ In the main part of this study, we 
consider two candidates for 5 —the Hausdorff metric 43 - 45 and the discrete Frechet 
metric 46 47 —and illuminate situations where one might be selected in favor of the 


other. Given two paths as input, both metrics locate two points, one per path, 
corresponding to some notion of a maximal deviation between the paths. An important 
property of these metrics is that they are sensitive only to path geometry; they are 
insensitive to dynamical motions and associated physical time scales along paths. We 
provide a brief overview of these two path metrics in the context of conformational 
transitions. 


Hausdorff metric. We start with a 3A^-dimensional configuration space containing 
two paths P and Q represented, respectively, as sequences of conformations 
{iPk)k=i I Pk G = 1,... ,n} and {iqk)T=i I ft G = 1,... ,m}. The 

Hausdorff distance is defined as 

dH{P,Q) = max{6hiP \ Q),Sh{Q \ P)} , (2) 


where 


5h{P I Q) = maxmind(p,q) 

peP qeQ 


( 3 ) 


is the directed Hausdorff distance from P to Q, and d is a distance metric on 
(measuring point distances) [^; the vertical bar {P \ Q) emphasizes that Sh{P \ Q) is 
not commutative. The function Sh{P \ Q) selects the point p* € P, among all points in 
P, with the most distant nearest neighbor q* G Q (as measured by d{p*, q*))- In the 
language of conformational transitions, we interpret d{p, q) as a putative structural 
similarity measure between conformers p and g, so that for some conformer pk G P, its 
structural “nearest neighbor” in Q is given by min^gg d(pk,q). Thus, Sh{P \ Q) is the 
distance d associated with the conformer in P having the most distant or least similar 
nearest neighbor (in Q). The Hausdorff distance between P and Q, 6h{P,Q), is 
therefore the distance associated with the point —of all points in P and Q —with the 
least similar nearest neighbor, and implies that all points have a nearest neighbor that 
is at most Sh{P,Q) away. 


Prechet metric. Unlike the Hausdorff metric, Frechet metrics are sensitive to the 
orientation (i.e., directionality) of paths; real transition paths are inherently directional 
which in principle makes Frechet metrics superior to the Hausdorff metric. Informally, 
the continuous Frechet distance can be visualized by considering a man walking on a 
path P and his dog on another path Q 47 . Both start at the initial points of their 


respective paths, and they are imagined to be connected by an elastic leash that 
remains taught so as to measure the distance separating them at all times. We then 
allow the man and dog to move independently on their respective paths under the 
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condition that each progresses in a monotonic fashion (i.e., no backward steps) from 
start to finish. The Frechet distance between P and Q is then defined as the length of 
the shortest leash necessary for the man and dog to move along their respective paths 
from beginning to end according to the aforementioned constraints. Formally, for two 
continuous curves P : [ao,ai] —?> oq < ai and Q : [foo,6i] —>■ , bo < bi that are 

parameterized with a real parameter, the continuous Frechet distance corresponds to 
finding two specific continuous and monotonous parameterizations a : [0,1] —)■ [ao,ai] 
and /3 : [0,1] —>■ [6o, 6i] (the “schedules” of the man and the dog along their paths) so 
that the largest point distance d for a given set of parameterizations is minimized [47| , 

6f{P,Q) = min max d(p{a{t)),Q{l3{t))'). (4) 

«,/3 tG[0.1] V ' ' / 


Algorithms exist to solve this difficult problem in O {nm log nm) time for polygonal 
curves (where n and m are the number of vertices in each curve) [47| and various faster 


approximate solutions have been suggested 48 49 


In this paper, however, we exclusively use the discrete Frechet distance, SdFi with 

as it is simpler and faster to compute (in O (nm) time) 


the algorithm outlined by 50 


than its continuous counterpart, 5f- The formal definition of 5dF considers two 
polygonal curves P and Q that are defined respectively by n and m ordered points in a 
metric space (V, d) for some metric d. Let the corresponding sequence of endpoints of 
the line segments of P and Q be respectively defined as a{P) = (pi, - ■ ■ ,Pn) and 
a(Q) = (qi,..., qm)- In the product space (j(Q, P) = (j(P) x a{Q), we define a coupling 
between two polygonal curves P and Q as a sequence. 


G(P,Q) = (Pai,qbi),{Pa2,qb2),- ■ ■ ,{PaL,qbL), (5) 

of L unique pairs of points (i.e., number of links) satisfying the following conditions; (1) 

The first/last pairs correspond to the first/last points of the respective paths 

(oi = bi = 1, gl = n and b^ = to); (2) at least one point on a path (for a pair of points, 

one per path) must be advanced to its successive point, i.e., (a^+i = and 

biFi = bi + 1) or (oi+i = 0 ^ + 1 and bi+i = bi) or (a^+i = 0 ^ + 1 and bi+i =5^ + 1) for 

all z = 1,..., L. The largest distance between a pair of points (pai,qbi) for a given 

coupling C defines the coupling distance 

||C|| = max (i(pa,,qb,). (6) 

Given the space of all possible couplings between P and Q, Fpq, the discrete Frechet 
distance between P and Q is the minimum coupling distance among all couplings in 

Y' p Q ■ 

SdFiP,Q) = ||C||. (7) 

C Gi P,Q 

The continuous Frechet distance constitutes a lower bound on the discrete Frechet 
distance, Sf < 5dF, because Sf accounts for points along the (straight) edges connecting 
the vertices, whereas SdF only takes the vertices themselves into consideration [^ . 
Furthermore, if we define the maximum edge length for a polygonal curve P to be the 
largest distance between consecutive points in P, dniax(L’) = rnaxi=i^,,,^p-i d (pi,piFi), 
we can set an upper bound on SdF given two polygonal curves P and Q so that 
Sf < SdF{P,Q) < Sf(P,Q) + niax{(ii„ax(f’),dmax(Q)} (^- Thus, SdF differs from Sf 
by no more than the longest edge among both paths and, to good approximation, 

SdF ^ Sf for typical trajectories with regularly spaced conformations. Hereafter we 
refer to the discrete Frechet distance as simply the Frechet metric (distance) with 
symbol Sf for brevity. The Frechet distance is bounded from below by the Hausdorff 
distance for any given pair of piecewise-linear curves (Sf > Sh) because for convex 
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Frechet distance 



Figure 1. Two paths P (green) and Q (cyan) begin at state co and end at state c/ with 
directionality indicated by the arrows. The Frechet distance Sf and Hausdorff distance 5h are given 
by the lengths of the purple and orange lines, respectively. The purple lines are the same length and 
correspond to the minimally stretched Frechet "leash”; the orange line spans a pair of points 
separated by the Hausdorff distance (only one is shown because in this case there are infinitely many 
pairs of points with the same Sh)- Due to the backtracking of path P toward state A, combined 
with the monotonicity (no-backward-movement) constraint of the Frechet metric, Sp > Sh- 


polygonal curves the Frechet and Hausdorff distances are equal 52 while for other path 


geometries the Frechet distance can become arbitrarily larger than the Hausdorff 
distance 48 . In the case of macromolecular trajectories, the case of backtracking 


appears particularly relevant because of its conceptual link to a random walk and its 
connection to thermal fluctuations. If one path runs backward along some portion 
relative to another path, the Frechet distance will increase with the extent of the 
backtracking, whereas the Hausdorff distance will be unaffected since it ignores the 
direction of path traversal (Fig. [^. 


Measuring structural similarity. Both the Hausdorff and Frechet distances 
defined in Eq. [^and Eq. respectively, are defined in terms of a point metric d{p, q) on 
3A^-dimensional configuration space that measures the distance (i.e., similarity) between 
conformations p and q. We employ the root mean square distance (rmsd) defined in the 
usual way as 


f^RMS (p, q) = 




1 


3N 

2=1 


( 8 ) 


where N is the number of atoms, and {pi}i^i and {qi}i^i define the configuration 
space coordinates of conformations p and q, respectively. 

It should be noted that Hausdorff and Frechet metrics can be defined in terms of 
other point metrics to measure and thus emphasize different aspects of macromolecular 
structure or topology. For example, one could choose to measure the similarity of two 
protein conformers by quantifying the percentage of shared contacts. Another promising 
approach may be to integrate information-based metrics used for measuring the 
similarity of protein ensembles 53 . In this paper, we exclusively used the best-fit rmsd 


as the point metric due to its simplicity and widespread use, helping to connect with 
familiar intuitions and avoid obfuscating the examination of the path metrics 
themselves. 


Previous studies and alternative approaches. The Hausdorff metric has found 


applications in image comparison 43 , while the orientation-dependent Frechet metric 


has been used for handwriting recognition and searching handwritten documents 54 


and comparing trajectories of moving objects in geographic information systems 55 
Both metrics have also found applications in biology for protein structure 


alignment 56 57 and protein homology analysis 58 59 
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To our knowledge, the Hausdorff and Frechet metrics have not been widely used as 
general tools to quantify macromolecular pathways. However, recently two studies 
employed Frechet distances to assess convergence of transition paths to an optimal path. 


Jiang et al. 60 used the same discrete Frechet metric as used in this study to assess the 


convergence of a swarms-of-trajectories string method. Dickson et al. 61 employed a 


variation of the discrete Frechet distance where the coupling distance was defined as the 
average distance between all pairs in a coupling (instead of the maximum distance as in 
Eq.§; this discrete average Frechet distance was used in combination with an adaptive 
biasing force method to assess the convergence to an optimal path in an a priori CV 
space and was found to produce easier-to-read results by reducing statistical noise 
compared to the conventional metric. We explore this metric in more detail in |Sl Tex^ 
along with a type of average Hausdorff distance. Protein folding pathways have been 
compared quantitatively but not with Hausdorff or Frechet metrics. Several such studies 


utilized native contacts-based path (dis)similarity measures 62 -65 . In particular, both 
Graham et al. 64 and Lindorff-Larsen et al. used dissimilarity scores to assign 


individual paths to folding pathways using clustering. Different methods to sample 
conformational transitions were compared by Huang et al. [66|, who compared the 


original targeted MD (TMD) algorithm 13 with a harmonic restraint variation of TMD 


(also known as “steered MD” (SMD) or “restrained TMD” (rTMD) [^) —and biased 
MD (HMD) approaches, and Ovchinnikov and Karplus [^, who analyzed the free 
energy profiles along the transition tubes surrounding the paths produced by several 
TMD variants. 

The use of the Frechet and Hausdorff path metrics on transition paths itself is not 
new; however, their application as general-purpose tools for quantitatively analyzing 
and comparing ensembles of transition paths—and extracting the molecular-scale 
determinants that dictate their differences—is, to our knowledge, novel. A particularly 
important advantage of the Hausdorff and Frechet metrics is that they do not require a 
choice of progress variable, unlike metrics based on binning trajectory snapshots to 
compute path rmsds. While we emphasize that PSA suggests a general approach to 
quantitative transition path analysis using different structural and path metrics, we 
restricted our study to the Hausdorff and Frechet path metrics implemented with the 
rmsd (as a structural similarity metric) to demonstrate the viability of a basic approach. 
We attempted to keep the underlying principles of PSA in view to engender future 
PSA-based analyses (such as quantifying putative reaction coordinates) and stress that 
this study does not purport to exhaust all applications of PSA, nor represent an 
optimized application. Other path metrics—e.g., Frechet with speed limits. 


direction-based Frechet 69 , or Frechet with shortcuts for the analysis of noisy 


data [70| —may offer advantages in carrying out various analyses. The Hausdorff 
distance can be generalized as well to measure, for instance, distances between surfaces 
(instead of ID curves) [^. The multitudinous permutations that can be selected among 
the various path metrics, structural similarity metrics, clustering algorithms, etc. make 
PSA a flexible tool for trajectory analysis. 


Model systems 

To investigate the applicability of the Hausdorff and Frechet metrics to the problem of 
quantifying transition paths, we generated trajectories using an abstract toy system and 
we simulated conformational transitions of two globular proteins, the enzyme adenylate 
kinase (AdK) in its ligand-free form and diphtheria toxin (DT). The toy model was 
designed to gain an intuition for the path metrics and their applicability to highly 
fluctuating paths in high dimensions. AdK’s closed/open transition (Fig. §4) is a 
standard test case that captures essential features of general conformational changes in 


proteins 12 . Alongside AdK in our analysis of transition ensembles, we also examined 
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Translocation 


Receptor-binding 



Figure 2. (A) The closed open transition for adenylate kinase (AdK) involves the hinge-like motion of the LID (green) and NMP 
(cyan) domains about the relatively stable CORE (gray). (B) Diphtheria toxin (DT) can exist in two different crystallographic 
conformations that are connected by a closed <->■ open transition involving the unravelling and swinging-open of the Translocation (T or 
Middle) domain (gray), about the Catalytic (C or N-terminal; green) and the Receptor-binding (R or C-terminal; cyan) domains. 


closed —f open DT transitions (Fig.[^), which serves as a more challenging example 
due to the difficultly of capturing the putative unfolding and refolding required for 
conformational change 72 . 

AdK is divided into three domains: the ATP-binding (or “LID”) domain, residues 
122-159 in the mesophilic Escherichia coli sequence (AKeco), and the AMP-binding (or 
“NMP” or “AMPbd”) domain, residues 30-59, move relative to the CORE 
domain (73 -77 around conserved hinges (Fig. [^). The conformational change can 


occur in the ligand-free (apo) state as demonstrated in multiple experimental 
studies [TS}]^ and corroborated by computational analyses (reviewed by Seyler and 
Beckstein (l2|). Therefore, the apo AKoco enzyme is a particularly suitable model 
system for studying general conformational transitions [^. We produced transition 
paths between an open conformation of AdK [represented by chain A of PDB id 
4AKE 77 from the Protein Data Bank (PDB)], and a closed conformation (chain 
A of lAKE 83 with ligand removed). 

DT is believed to undergo a transition from an inactive closed conformation to an 
active open one, which includes a 180° rotation of a mobile domain (Fig.[^). An 
open conformation was captured in a domain-swapped dimeric structure |85| and 
compared to the closed monomeric structure 


86 . DT is divided into three domains. 


with the translocation (T) domain, residues 179-379, being responsible for the majority 
of the opening and unrolling conformational motion about the receptor-binding (R) 
domain, residues 380-535, and the catalytic (C) domain, residues 1-178. The 
conformational transition of a DT monomer was simulated previously and considered 
challenging for simulation methods 39 7^ . We simulated transition pathways of DT 
between a closed and open conformation based on chain A from the monomeric 
structure (PDB id: IMDT (8^ ) and chain A from the domain-swapped dimeric 
structure (PDB id: IDDT [85|), respectively. 


Methods 


In the following we define the PSA approach as implemented in this study (using the 
metrics described in the Introduction), and we also summarize several alternative 
approaches to analyzing transitions that we employed alongside PSA for comparison. 
We describe how a range of conformational transition paths was generated to supply a 
variety of contexts in which to test PSA. 

Molecular images were created with VMD 87 and the Bendix plugin 88 . Graphs 

and seaborn 


were plotted with the Python libraries matplotlib 89 


implementation of violin plots 91 


90 , in particular its 
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Characterizing transition paths 

Path similarity analysis (PSA). The Hausdorff metric, 6h, and the discrete 
Frechet metric, Sdp, defined in Eq. [^and Eq. respectively, were computed as 
described in the Introduction. Further details on the numerical implementation are 
provided in|S2 Text[ Both metrics are implemented as part of the MDAnalysis Python 
package [92| in the module MDAnalysis. analysis .psa, which is available as open 
source at www.mdanalysis.org under the GNU General Public License 2. 

To analyze a set of N paths, we compute the N{N — l)/2 unique pairwise Hausdorff 
and Frechet distances. To present the data efficiently, we levied the versatility of 


hierarchical clustering 93 along with the visual power of a heat map-dendrogram 


representation to present a quantitative approach to visualizing the similarities of 
collections of paths. In agglomerative hierarchical clustering, objects are linked with 
similar objects to form growing clusters in a bottom-up approach. The similarity 
between two objects is defined by a metric, while the similarity of clusters (i.e., sets of 
objects) is uniquely determined by a linkage criterion that computes cluster similarity 
as a function of the pairwise similarities of the objects comprising each cluster. 

Using the Hausdorff and Frechet metrics as similarity measures, we employed Ward’s 


method 94 in conjunction with agglomerative hierarchical clustering as implemented in 
the SciPy Python package 95 . The Ward linkage criterion specifies a minimum 


variance criterion that minimizes the total intra-cluster variance. In light of the focus of 
this paper, we restrict our study to hierarchical clustering using primarily Ward 
linkage—details regarding this restriction are provided in |S3 Te^ in the Supporting 
Information along with other relevant considerations in using cluster analysis to 
facilitate PSA. 


Native contacts analysis (NCA). For consistency with other methods used in this 
paper, we define a contact to be a residue pair whose Gq, atoms are separated by a 
distance smaller than 8 A. A native contact is a contact present in a reference structure. 
Given a transition path, the fraction of native contacts Q 96 is the fraction of contacts 


in a native structure that are present in a transition structure. We then define, for any 
intermediate conformer in a transition, Qi and Q 2 as the fractions of native contacts 
with respect to an initial and final structure, respectively. Transition paths are 
projected onto 2D Q 1 -Q 2 (NC) space by parametrically plotting the percentage of 
contacts relative to the initial and final states. 


Comparison with a linearly interpolated path. A simple way to quantify the 
geometry of a single transition path is to measure its orthogonal separation, p, from a 
reference path as a function of progress, along the reference path (Fig. [^. In this way, 
any transition path can be projected in a 2D space depicting “displacement” (p) versus 
“progress” (C) relative to a reference path. We selected naive linear interpolation (Linint) 
to serve as a zeroth-order reference transition path. Note that, in comparison with PSA, 
this approach necessitates defining an explicit progress measure in the form of a 
reference path—which may not be appropriate beyond relatively simple examples like 
the AdK transition—and is furthermore not amenable to direct pairwise comparisons 
among a large ensemble of transition paths. 

Given two boundary conformations {co,cy} S in 3A-dimensional configuration 
space with reference path R embedded in (that linearly interpolates cq and c/), 
and a piecewise-linear (transition) path P embedded in and composed of a 
sequence of conformations, {pk)'k=i^ where m is the number of time steps, we compute 
for each pk'- (1) the rmsd between pk and its orthogonal projection onto R, Vk, 

/o(fc) = dRMs(PfeUfc), (9) 
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Figure 3. A hypothetical transition pathway P (cyan line) in a 3D configuration 
space composed of a discrete number of conformer snapshots (cyan circles) 
connects an initial state (green circle), co, and final state (red diamond), c/. 
The reference path R (black line) is represented by Linint. Each snapshot pk is 
associated with its projection, Vk, on i?; the progress, ({k), is the rmsd between 
Vk and Cf (dashed purple line along R) and the displacement, p{k), is the rmsd 
between pk and Vk (dashed purple line perpendicular to R). 


and (2) the rmsd between and final state c/, 

C(/c) = dRMs(ffc, c/) (10) 

(see Fig. [^. A transition path can then be projected onto (^-p space by parametrically 
plotting C,{k) versus p{k) for all values of k. For a path beginning at rg = cg, the rmsd 
to the final structure is given by the rmsd between the initial and final states, 

C(0) = dRMs(co,c/), while the rmsd for a path ending at = c/ is 
C(m) = dRMs(c/, Cf) = 0. 

Defining p using the rmsd permits a close connection with PSA in the following way: 
the maximal rmsd of a path P from Linint, max^^{p(A:)} will be the Hausdorff 
distance between P and Linint, (5ij(P, Linint), when P is restricted to the region of 
configuration space between the boundary conformations (and assuming that structural 
alignment prior to rmsd measurement was performed identically). Furthermore, when P 
does not “backtrack”, ({k) is monotonically decreasing—indeed, P can be said to 
backtrack (with respect to some reference path) when C(fc) is not monotone—and the 
Hausdorff and Frechet distances coincide: 
max^;^{p(A:)} = (5i?(P, Linint) = 5//(P, Linint). 


Heuristic collective variables. While dimensionality reduction can be useful for 
visualizing and identifying functional protein motions, selecting the collective variables 
that span the projected space and adequately describe a conformational transition is 
nontrivial 


97,98 . Choosing heuristic coordinates for a given system often requires 


strong physical intuition, something that may absent when studying new or complicated 
transitions. In general, the determination of reaction coordinates and/or order 
parameters can be guided by quantitative methods, such as principal component 
analysis or the construction of isocommittor surfaces. In the relatively simple case of 
AdK’s closed O open transition, several viable order parameters have been used as 
low-dimensional descriptions 


12 


To explicitly illustrate the uses and limitations of heuristic collective variables, and 
to make a connection with previous work, we examine the AdK closed O open 
transition (Fig. [^) in 2D angle-angle space [^. The NMP-CORE angle 0 nmp is 
formed by the geometric centers of the backbone and Cp atoms in residues 115-125 
(CORE-LID), 90-100 (CORE), and 35-55 (NMP) of E. coli AdK. Likewise, Slid is 
defined as the angle between residues 179-185 (CORE), 115-125 (CORE-hinge-LID), 
and 125-153 (LID). As many of the methods we studied used Cc-only models, we 
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defined NMP-CORE and LID-CORE angles by exclusively using the Cq, atoms of the 
residues. The angle-angle space defined by (0 nmp, ^lid) quantifies the degree to which 
NMP and LID are open and the sequence in which they open (close) for the closed —?► 
open (open —?> closed) transition. 

Generating transition paths 

We first describe the toy model system used to supply simple transitions for testing 
purposes. We then summarize the path generation—using a variety of enhanced 
path-sampling methods—of closed —>■ open transitions of AdK and DT, which serve as 
more realistic representations of conformational transitions. 


Toy model: Double-barrel potential. To determine the extent to which the 
Hausdorff and Frechet metrics are suitable for measuring transition paths, we 
constructed a toy system to generate well-defined trajectories driven by a one-way ramp 
potential and subject to thermal noise; the resulting paths in configuration space can be 
viewed as thermally-perturbed straight lines. For our purposes, transition progress was 
measured by the center-of-mass distance of a group of particles moving along the ramp 
so that a transition was completed once the center-of-mass trajectory crossed a 
threshold value. 

The toy system is defined as a group of N particles connected by harmonic springs 
subject to Brownian dynamics in a 3D potential energy landscape (Figure]^. Individual 
particles were connected in analogy to a complete graph, with vertices and edges 
respectively representing particles and springs. Spring equilibrium distances were set to 
zero separation for simplicity. Differing dimensionalities of the configuration space were 
examined by varying the number of particles N. The external potential was given a 
double-well shape in the y-direction with a parabolic shape in the cc-direction (centered 
at X = y = 0), ensuring that particle clusters are confined to one of two “barrels” 
running along the ^-direction (Figure]^. The energy barrier between the tubes was set 
to a height of 2 ksT (~ 5 kJ/mol) at T = 300 K. We set up a ramp potential sloping 
down toward increasing 2 (i.e., a constant potential energy gradient in the positive z 
direction) to induce large-scale transitions from small to large values of z. 

To construct a properly coarse-grained system, we required zero-temperature cluster 
dynamics to be identical for all A-particle clusters (given sensibly chosen initial 
conditions). Spring constants, particle masses and sizes, and the external potentials 
were scaled so as to preserve the average diffusive behavior of a cluster. Furthermore, 
spring constants were chosen to be large enough to prevent clusters from splitting 
themselves across the central barrier (where some particles in the cluster fall to one 
tube and some fall to other). Particles comprising a cluster were furthermore initialized 
at the same location so that zero temperature center-of-mass trajectories would be 
independent of particle number, N. It should be emphasized that this toy model was 
not intended to replicate a real physical system, but primarily served to build intuition 
prior to studying conformational pathways in realistic protein systems. More detailed 
information about the construction of the double-barrel system is provided in |S4 Te^ in 
the Supporting Information. 


Simulation methods and systems. The path-sampling methods comparison was 
performed using the AdK closed —>■ open transition —between the (initial) closed 
conformation (PDB id 1AKE:A) and final (open) conformation (PDB id 4AKE:A)—as a 
testbed. We used eight methods available on publicly accessible servers [3Tj[7^ [T00Hl04| 


two in-house methods (DIMS, dynamic importance sampling [I05| , and FRODA, 
Framework Rigidity Optimized Dynamics Algorithm 39 ), and targeted MD 
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Figure 4. The toy model consists of a cluster of connected particles moving in a 
double-well potential along the z-axis under the influence of a linear ramp 
potential (not shown). In the cluster for N = 8, each particle (red) is connected 
to every other particle with a harmonic spring (blue) of equilibrium length 0 
(cluster not shown to scale.) The potential landscape for constant « forms a 
“double barrel"—red (blue) regions correspond to high (low) energies—is 
parabolic along the ^-direction (cyan line), and has a double-well shape in the 
j/-direction (purple line), which produces a central barrier separating two 
“barrels” (gray crosshatching). A saddle point is located at the intersection of 
the cyan and purple lines. Motion in this landscape is biased toward either of the 
low-energy barrels, but transitions between barrels are possible at finite 
temperatures. 


(rTMD [^) using local simulation resources (see Tablesandfor overviews). DIMS 
and FRODA were additionally used to generate example ensembles of AdK and DT 
transitions (200 transitions per method per protein, 800 total) for ensemble-based and 
Hausdorff pairs analysis. In principle, other path-sampling methods could be included 
in a comparison and, in the future, it would be worth exploring alternative methods 
such as the finite-temperature string method 


106 , weighted ensemble 


dynamics 107 108 , milestoning 109 , transition path sampling [110 , non-equilibrium 
umbrella sampling m or forward flux sampling [112| , to name a few. Key aspects of 
each method used in this study are summarized below to help connect our results with 
physical intuition about the models. Each path-sampling method is described in the 
context of the energetics they model (Table and the schemes by which paths are 
propagated or generated (Table [^. Additional details about the methods and the 
corresponding simulation settings that were used can be found in |S5 Tex^ 

Two of the tested methods are based on MD combined with perturbation techniques 
(perturbation MD) designed to drive transitions between initial and final states. Two 
MD methods— DIMS MD 15 105| (implemented in CHARMM c36b2 |113| ) and TMD 
(implemented in NAMD 2.10 |114j ) —used the all-atom CHARMM22/CMAP force 
field |115[ 116| with Langevin dynamics and implicit solvents (ACE [HZ] in CHARMM, 
Generalized Born 118 in NAMD) in the NVT ensemble at 300 K. NAMD’s TMD 
implementation uses a time-dependent harmonic restraint that moves toward a target 
conformation with constant velocity [^, instead of the original TMD approach 
introduced by Schlitter [13| that employed a stepwise holonomic constraint; in 
accordance with Ovchinnikov and Karplus 68 , we refer to the NAMD implementation 


as restrained TMD (rTMD) to distinguish it from the original algorithm. DIMS and 
rTMD transitions were driven using the heavy-atom rmsd to the target structure. The 
soft-ratcheting DIMS algorithm moves towards the target by taking trial MD steps. 
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Table 1. Modeling of energetics in tested path-sampling methods. 


Res" 

Name 

Force field/potential^ 

Solvent energetics" 

Mixing function/other energetics^ 

all-atom 

DIMS 105 

CHARMM22/CMAP 

ACS/ACE2 IS 

T = 300 K 


rTMD rer 


CHARMM22/CMAP 

Generalized Born IS 

T = 300 K 


MDdMDllOO 

bonds/angles: inf. sq-well 

Lazaridis-Karplus IS 

NBF: simple vdW/electrostatic, T = 300 K 


FRODA ife 

stereochemical constraints 

hydrophobic contacts* 

overlap/angle/H-bond constraints 


Morph |72 


CHARMM/XPLORi" 

- 

energy minimization of intermediate snapshots 


Linint 

- 

- 

- 

Cc-only 

GOdMD 101 

bonds: inf. sq-well 

- 

NBF: Go-like -F ENM-MetaD 


ANMP 102 

double-well ANM 

- 

Em\K = 


lENM 103 

double-well ANM 

- 

Em\x = E{Ui,Uf) (arbitrary), collision penalty 


MAP 31 

two AN Ms, OM dynamics 

overdamped Langevin^ 

minimum OM action — > 2 ODEs+BCs — >■ path 


MENIWSD/SP (104 

double-well ANM 

- 

Fmix = ln{exp[-d(t/i -F Ci)] -F exp(-d(t// -F £/)]} 


All MD-based methods use atomic resolution; Morph and Linint are the only other methods with greater than Cc resolution. Except for MAP, 
ENM-based models define double-well potentials using different mixing functions of each anisotropic network model (ANM) constructed about 
each native states. MAP uses 2 DDEs, found by minimizing the Onsager-Machlup action for each ANM about the native states, and satisfying 
continuity conditions for positions and velocities at their interface. MENM-SD/SP assumes weak mixing: Tm = T (/3 = 1/kTm, is an 
adjustable parameter); in the limit of vanishing mixing, Tm — > 0 '*', En,\x = mm{Ui,Uf}, which is the same double-well potential used by 
ANMP. 

^Resolution of the model. 

'^inf. sq-well, infinite square well; ANM, anisotropic network model; OM, Onsager-Machlup. 

"^IS, implicit solvent; FRODA does not have a solvent model; MAP assumes overdamped Langevin dynamics in using the Onsager-Machlup 
action. 

'*NBF, non-bonded forces; vdW, van der Waals potential; ENM-MetaD, elastic network model-based metadynamics; Smix, mixing function for 
two-state potential; Ui {Uf), potential energy function about the initial (final) native state; OM, Onsager-Machlup; ODEs-fBCs, ordinary 
differential equations plus boundary conditions. 

FRODA does not use a solvent model. 

^Morph uses CHARMM/XPLOR relaxation to minimize energy of intermediate snapshots. 

*MAP assumes overdamped Langevin dynamics in using the Onsager-Machlup action. 


Steps toward the target (decreasing rmsd-to-target) are accepted whereas backward 
steps are rejected with a finite probability; velocities are re-sampled (according to 
Maxwell-Boltzmann) until the step is accepted. rTMD moves a harmonic restraint to 
linearly decrease the rmsd-to-target. We generated three rTMD paths using a fast 
pulling speed and three with slower pulling. rTMD differs from DIMS in that explicit 
forces are introduced into the system Hamiltonian whereas DIMS effectively introduces 
an entropic force. 

Maxwell-Demon discrete Molecular Dynamics [100| (MDdMD) and GOdMD 


101 


are similar in spirit and are the only two methods based on a physical dynamical model 
among the server-based transition path generation methods. Both are based on discrete 
MD combined with soft ratcheting and a type of essential dynamics sampling. MDdMD 
utilizes an atomistic representation and an implicit solvent model; GOdMD bonds uses 
a Cq, representation and neglects solvent effects. Both methods model bonded forces 
with infinite square-wells although MDdMD incorporates further detail by using simple 
potentials to describe van der Waals and electrostatic forces; GOdMD uses a Go-like 
potential to describe non-bonded forces. Both also include an additional form of biasing 
to ensure transitions follow essential deformation movements of a protein: MDdMD 
accepts steps when the transition vector (from the current conformer to the target) 
overlaps sufficiently with an essential transition vector (defined using eigenvectors from 
NMA on a Go-like potential about the initial or final state); GOdMD uses an 
ENM-based metadynamics approach to bias the sampling of essential deformation 
modes and to ensure that trajectories escape the initial minima. 

The geometrical targeting algorithm, FRODA 39 , is an approach designed to 


produce stereochemically acceptable transition paths. FRODA moves a structure 
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Table 2. Approach to generating paths in tested path-sampling methods. 


Type 

Name 

Dynamics 

Path propagation/biasing^ 

Rev'= 

TS/Stoch" 

Progress variable 

perturbation MD 

DIMS 1105 


Langevin NVT 

SR 

N 

Y/Y 

rmsd-to-target 


rTMD 1^ 


Langevin NVT 

moving harmonic restraint 

N 

Y/Y 

rmsd-to-target 


MDdMD flOOtl 

discrete MD 

SR -F essential dynamics 

N 

Y/Y 

ssd-to-target^ 


GOdMD [1 

0T| 

discrete CG-MD 

SR + metadynamics 

N 

Y/Y 

ssd-to-target^ 

geometric targeting 

FRODA 3 

9 

- 

stepwise-enforced rmsd constraint* 

N 

Y/(Y/N) 

rmsd-to-target 

GG-ENM 

ANMP 10 

2 

- 

SD from SP (cusp min.) to minima 

Y 

N/N 

- 


iENM [103] 

- 

parametric SP/fixed-point eqn. 

Y 

N/N 

- 


MAP 31) 


- 

OM minimum action path 

Y 

N/N 

- 


MENIVbSC 

p4l| 

- 

SD from SP to minima 

Y 

N/N 

- 


MENM-SP 1104] 

- 

parametric SP/fixed-point eqn. 

Y 

N/N 

- 

adiabatic mapping 

Morph 72] 

- 

linearly interpolated snapshots 

Y 

N/N 

- 

linear interpolation 

LinInt 

- 

linearly interpolated snapshots 

Y 

N/N 

- 


DIMS, rTMD, MDdMD, and GOdMD are all non-deterministic MD-based methods. DIMS and rTMD employ a conventional force field and 
Langevin dynamics in the canonical ensemble; the discrete MD algorithms used by MDdMD and GOdMD assume ballistic particle motion until 
a collision occurs—along w/ith the depth of the interatomic square wells, momentum and energy conservation are used to determine outgoing 
momenta without explicitly computing forces. FRODA uses a non-physical dynamical algorithm to path-search stereochemically correct regions 
of configuration space. CG-ENM methods generate transitions by constructing low-energy paths in the potential energy landscape. Morph and 
Linint linearly interpolate the position of each atom between the initial and final states. 

^SR, soft ratcheting; SD, steepest descent; SP, saddle point; OM, Onsager-Machlup. 

^Is the method exactly reversible? 

■^Is the algorithm based on a (physical or non-physical) time step? Is it stochastic? 

At each step, rmsd reduced by fixed amount while simultaneously enforcing other constraints. 

^ssd, sum of squared distances to target (includes weighting that varies between MDdMD and GOdMD). 


toward a target conformation by decreasing the rmsd-to-target while enforcing 
stereochemical constraints such as bond distances and angles, backbone dihedrals, and 
contact constraints. In particular, FRODA can avoid steric clashes in an all-atom 
structure, something that may not be achieved by coarse-grained elastic network models 
(CG-ENMs) or algorithms using simple linear interpolation. 

We also generated transitions using five CG-ENM-based methods. These particular 
models first construct two harmonic potential energy functions, based on anisotropic 
network models (ANMs), about initial and final native (crystallographic) states, which 
has the general form 

17(X) = i ^ a, {d,, - + AU, 

d°j<Ro 

where the sum is taken over all unique pairs of Gq, atoms separated by less than a 
specified cutoff distance, Rc, and AU is the energy difference between the two states. 
For atoms i and j, Cij is the force constant, dij is the Euclidean distance between them, 
and d^j is the corresponding distance in the native (crystallographic) structure. Force 
constants can determined by fitting to isotropic crystallographic B-factors for instance. 
A double-well (two-state) potential landscape is constructed by combining the separate 
potentials. Given a two-state potential, transition paths are generated by connecting 
the two (end-state) minima along low-energy pathways. The ENM-based methods are 
distinguished primarily by their two-state energetics (i.e., mixing potential) and method 
of defining and searching for low-energy transition paths. The cutoff distance, can 
adjusted to some degree for all the tested ENM-based approaches, but a couple also 
enable modification of the force (spring) constants, Cij, and the end state energy 
difference, AU. 

ANMPathway [102| (ANMP) forms a double-well landscape by taking the lower of 
the individual wells; the wells intersect to form a cusp hypersurface in configuration 
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space. A path is found by locating the minimum along the cusp and performing steepest 


descent (SD) toward both well minima. The Mixed Elastic Network Model 104 


(MENM) employs a double-well function with a tunable mixing temperature whose 
purpose is to module the cusp-like intersection to provide a smooth transition in energy 
between the two wells. The method locates saddle points (SPs), and can use a steepest 
descent (SD) mode (MENM-SD) from the SPs to the minima, or it can provide a 
parametric equation describing an SP path through the fixed points of the landscape. 
The interpolated Elastic Network Model [103] (iENM) is similar to MENM-SP in that 
it analytically solves for a parametric SP path, although it only requires a general form 
of a double-well potential function (does not use an explicit mixing function). Unlike 
the other ENM-based methods, MinActionPath (MAP) does not use a mixing 
function. Instead, a path is generated by minimizing the Onsager- Machlup (OM) 
action—which assumes overdamped Langevin dynamics—with the two separate ANMs 
for the native states to derive two one-dimensional differential equations describing the 
minimum action paths in the region of each ANM. A unique transition path between 
the initial and final states is found by satisfying continuity boundary conditions in the 
positions and velocities at the interface. 

To provide a point of comparison to one of the most simple approaches, we used the 
Yale morph server 72 119| (Morph), which combines linear interpolation and optional 
energy minimization of the intermediate snapshots (i.e., adiabatic mapping), and we 
also used explicit linear interpolation to generate a single path between the end states 

(Linint). 

The main thrust of the path-sampling methods comparison is to demonstrate PSA’s 
viability and not necessarily to directly evaluate the performance of the sampling 
algorithms. As such, adjustable parameters for all simulations were left at their default 
values unless explicitly stated. Transitions were produced using the highest allowable 
resolution, i.e., using all non-hydrogen atoms when possible or only Cq, atoms otherwise. 
For each method, three unique paths were generated by either re-running those with 
stochastic algorithms or, for the deterministic ones, by adjusting a single parameter; in 
the case of rTMD, six total simulations w ere perfo rmed [three each for fast (^ 1 A/ps) 


and slow (~ 0.01 A/ps) pulling speed; see 


S5 Text 


for further details]. DIMS, FRODA 


and MDdMD simulations, which produce a unique trajectory every run, were run three 
times each without altering initial settings. Three GOdMD runs were performed by 
changing the relaxation window (20 ps, 50 ps and 100 ps). Distinct trajectories for the 
deterministic, ENM- based algorithms were obtained by varying spring cutoff distances: 
one transition at the default value and two by decreasing/increasing the cutoff. Morph 
trajectories were produced by toggling energy minimization and structural 
pre-alignment settings, and a single Linint trajectory was included as a zeroth-order 
reference. All other simulation settings were left at default values where possible. 
Simulations and analyses performed in this study are summarized in Table 
Furthermore, as half of the methods were limited to Cq, structures as inputs—the 
coarsest representation among the methods—all analyses were restricted to Cq 
trajectory representations to provide a lowest common denominator. Trajectories were 
also aligned to a common reference structure generated by aligning and averaging the 
CORE Cq coordinates of the 1AKE:A and 4AKE:A structures (see S6 Text in the 
Supporting Information for a description of the structural alignment procedures). 


Results and Discussion 

We subdivided our study in four parts to show how PSA can be used to answer a range 
of questions about macromolecular transition paths and pathways (see Table §: (1) 
The path metrics were able to distinguish and categorize simple trajectories in a toy 


PLOS 


15/40 













•@'PLOS I SUBMISSION 


Table 3. Summary of simulations, calculations, and analyses. 


Assessment 

System 

Transition 

Path generation 

path samples 

Analysis methods^ 

(1) Intuition and viability 

double-barrel 

z: 0 —>■ 4 nm 

Brownian + ramp 

4x(2 ICs) 

PSA (Sf), Sf-Sh distr/corr* 

(2) Methods comparison 

AdK 

closed —>■ open 

various methods 

3x(ll methods) 

PSA {Sf/Sh*), NCA, C-p, AA 

(3) Transition ensembles 

AdK 

closed —> open 

DIMS, FRODA 

200x(2 methods) 

PSA (5f*), 5f-5h distr/corr* 


DT 

closed —> open 

DIMS, FRODA 

200x(2 methods) 

PSA (Sf), Sf-5h distr/corr* 

(4) Atomic detail from PSA 

AdK 

closed open 

DIMS, FRODA 

200x(2 methods) 

PSA (fe-pairs) 


*Result in Supporting Information 

^Analysis methods: PSA, path similarity analysis; Sp, Frechet distance; 5h, Hausdorff distance; 3f-Sh distr/corr, Frechet-Hausdorff 
distribution/correlation analysis; NCA, native contacts analysis; (-p, progress vs. displacement along path of linear interpolation; AA, 
angle-angle space. 


system, taking thermal motion and varying number of particles into account. (2) PSA 
could be used to compare different path-sampling methods and, when combined with 
more traditional low-dimensional projections on collective variables, provide insights 
into similarities and differences between different methods. (3) PSA was able to analyze 
path ensembles, opening the door to analyzing dynamical simulations with statistical 
approaches. (4) PSA enabled us to extract the molecular structural determinants 
responsible for differences in paths, thus linking the general analysis of high-dimensional 
transition paths to the specific molecular detail. 


Path similarity analysis of toy model transitions 

We simulated one- and eight-particle cluster transitions in the double-barrel potential 
energy landscape between a starting state (defined as a center-of-mass location below 
z = 0 nm) and a final state (z > 4 nm). Eight-particle simulations at zero and 250 K 
are shown in Fig.[^ The particles were weakly confined to one of two potential energy 
barrels separated by a 2 barrier at 300 K (Fig.[^,D) and evolved under the 
influence of thermal diffusion and drift due to a linearly decreasing ramp potential in 
the z direction (Fig.[^,E). Simulations were run at temperatures between 0 K and 
600 K in 50 K increments, with eight runs at each temperature. Trajectories were 
initialized such that two distinct groups of paths would be produced at zero 
temperature: for each temperature, we initialized half of the simulations to one side of 
the central barrier at (xq, yo) = (0 nm, 0.4 nm) and the other half at (0 nm, —0.4 nm). 

At zero temperature, trajectories initiated at the same point progressed along 
identical paths due to the absence of thermal diffusion. Two trajectory groups were 
formed (Fig.[^,B), consistent with what was expected from the initial conditions. A 
clustered heat map of the Frechet distances between the T = 0 K trajectories clearly 
showed two well-defined clusters (Fig. [5p), containing four trajectories each, in both the 
structure of the dendrogram as well as the color division in the heat map. Due to 
thermal perturbations, higher-temperature trajectories exhibited substantial wandering 
(Fig.J^,E) and even produced a transition across the central barrier (blue trajectory in 
Fig. pp). In contrast with the zero temperature case, both the number of clusters and 
the clusters themselves were much more vaguely defined. Two clusters with four 
trajectories per cluster (red and green/blue trajectories. Fig. [^-F) were still formed, 
although the blue trajectory, which underwent a barrier-crossing transition near 
z = —0.5 nm, is an outlier in the cluster with the three green trajectories. 

Trajectory categorization for the toy model with PSA did not depend strongly on 
the dimensionality (cluster size) as thermal noise alone appeared to have a much more 
substantial influence (SI Figl. In particular, we could not discern meaningful differences 


in the center of mass motions between one- and eight-particle clusters from the data. 
Furthermore, in the eight-particle case at 250 K, performing PSA using the full 
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Figure 5. Double-barrel potential energy landscape projected onto the ly-plane and t/ 2 :-plane. Groups of point masses (clusters) 
mutually connected by harmonic springs move under the influence of a transition-inducing ramp potential in the positive a 
direction and the two low-energy minima of the “barrels” at t/ = ±0.8 nm. Colored lines depict the center of mass trajectories 
for each cluster. (A-C) trajectories at T = 0 K. (D-F) trajectories at T — 250 K. (A, D) Projection of paths onto the a;i/-plane 
together with the double-barrel potential. (B, E) Projection of paths onto the y«-plane. (C, E) Clustered heat maps summarize 
the Frechet distances for all pairs of trajectories; dendrograms record cluster distances according to the Ward criterion. 
Trajectory colors in each row match the corresponding path(s) in the dendrogram. The trajectory-averaged radius of gyration for 
clusters at finite temperature is 0.35 nm (black circles). 


(nm) 

1.60 

1.42 

1.24 

1.07 

0.89 

0.71 

0.53 

0.36 

0.18 

0.00 

1.89 

1.68 

1.47 

1.26 

1.05 

0.84 

0.63 

0.42 

0.21 

0.00 


(24-dimensional) configuration space trajectories did not produce a different clustering 
than PSA applied only to the center of mass trajectories (data not shown). The same 
analysis as above was carried out with the Hausdorff distance instead of the Frechet 
distance to assess their relative discriminative powers. Both produce similar results at 
temperatures below 300 K with low-temperature simulations exhibiting two distinct 
pathways (S2 Figl. Between 350 K and 500 K, however, Hausdorff and Frechet distance 
measurements started to become substantially uncorrelated (S3 Fig). This effect is 
likely due in part to the sensitivity of the Frechet metric to backtracking (Fig. [^, which 
may be amplified when the typical energy of thermal perturbations become comparable 
to the height of a potential barrier (2kBT at 300 K). High-temperature simulations 


(>300 K) began to explore both tubes as if they were a single pathway (S2 Fig and S4 


Taken together, PSA was able to distinguish groups of paths in the presence of 
stochastic thermal motions as long as the thermal energy was lower than the energy 
scale of distinguishing features in the underlying energy landscape. The dimensionality 
of the problem did not appear to be an important factor. Frechet and Hausdorff 
distances discriminated paths equally well with some small differences at high 
temperatures that likely reflect backtracking of trajectories. 
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Comparing enhanced path-sampling methods 

In order to compare a selection of fast transition path sampling methods, three distinct 
trajectories were generated for the closed —)■ open AdK transition as described in 
Methods. 


Direct comparison using PSA. A total of 37 paths (eleven methods, three paths 
per method with the exception of six paths for rTMD, plus one Linint path) were 
analyzed by computing the Frechet distance between all possible pairs and clustering of 
the resulting (symmetric) distance matrix (S5 Fig). Using the same approach as with 
the toy model, the clustered distance matrix was translated to a heat map-dendrogram 
representation (Fig. [^. 
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Figure 6. Path similarity analysis of trajectories generated by different path-sampling methods. 


The AdK closed—^-open transition was sampled three times (except Linint) with different methods 
(see text). Smaller distances indicate transition paths with greater similarity. The dendrogram 


depicts a hierarchy of clusters where smaller node heights of parent clusters indicate greater 
similarity between child clusters. Frechet distances Sp are in A and correspond to a structural rmsd 
in accordance with the rmsd point metric. See text for a description of the methods. |S5 Fig| 
contains the same data annotated with numerical values of Sp. 


Paths from a given method were more similar to other paths from the same method 
than to those produced by a different method, as indicated by well-defined 3x3 squares 
along the heat map diagonal. Methods based on similar physical models tended to 
produce relatively similar pathways, while algorithmically distinct approaches appeared 
less likely to produce similar pathways. For instance. Morph and Linint both implement 
linear coordinate interpolation. Their paths are essentially identical (Sp < 0.5 A), which 
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indicates that additional features implemented in Morph, such as checking for steric 
overlaps, may not be relevant for the AdK transition. Another cluster was formed by 
the two MD-based importance sampling methods, DIMS and MDdMD, together with 
MD-based rTMD at slow pulling velocity (“rTMD-S”; Frechet distance 
2.1 A < < 2.7 A). In other cases, similarities and differences did not always follow 

an immediately obvious pattern. FRODA, which satisfies rigidity constraints during a 
transition but does not employ a potential energy function, nevertheless formed a 
cluster with DIMS, MDdMD, and rTMD-S (2.6 A < Sp < 3.1 A). The grouping of 
FRODA with DIMS/MDdMD/rTMD-S appears, however, less strong than, for instance, 
the clustering of DIMS with MDdMD because for other choices of the linkage algorithm 
FRODA is more distantly associated with the DIMS/MDdMD/rTMD-S cluster and a 
robust cluster of MAP/Morph/LinInt trajectories (see S6 Fig B-D and further 
discussion in S3 Text I. The fast-pulling rTMD (“rTMD-F”) and MAP trajectories were 
strikingly similar to the Morph paths {Sp ~ 1 A), even though rTMD-F performs MD 
with an atomistic physics-based force field, whereas MAP’s energy function is based on 
an elastic network model and the path is generated via minimization of 
Onsager-Machlup action (and not just linear interpolation). Interestingly, the 
MAP/rTMD-F/Morph sub-cluster was grouped with the cluster formed by four of the 
dynamical algorithms (DIMS, MDdMD, rTMD-S, FRODA). The other four ENM 
algorithms—iENM, MENM-SD/SP, and ANMP—produced their own cluster, with 
MENM-SD and iENM being the most similar to each other. A careful examination of 
the heat map revealed that although MAP, rTMD-F, and Morph paths somewhat 
resembled iENM and MENM-SD paths {Sp < 2.5 A), their overall patterns of Frechet 
distances were very similar to DIMS/MDdMD/rTMD-S (as seen in the similar overall 
striping in the shading of the heat map) so that the “Morph-like cluster” rather 
clustered with these dynamical methods than with the “ENM cluster”. The GOdMD 
paths formed their own outlier cluster, appearing substantially different from all other 
methods {Sp > 3 A). 

The classification of trajectories was found to be robust against use of different 
linkage functions in the clustering algorithm, provided that the linkage primarily 
assessed the dissimilarity of clusters (such as Ward’s criterion in Fig. and the 
complete/average/weighted linkage in S6 Fig B-D) instead of similarity (single linkage 


S6 Fig A). Using the Hausdorff metric instead of the Frechet metric did not change 


the clustering either and the Pearson correlation coefficient between Sp and Sp was very 
close to unity ( |S7 Fig ). In SI Text[ alternative distance definitions, namely averaged 
Frechet and Hausdorff distances (which are, however, not proper metrics), reduced the 
amount of detail in the clustering and resulted in an amalgamation of clusters into one 
large “dynamical methods cluster” (TMD-S, DIMS, MDdMD, GOdMD, FRODA), a 
“Morph-like cluster” (Morph, Linint, TMD-S, MAP), and an “ENM cluster” (ANMP, 
iENM, MENM-SP/SD). 

Without any input except the trajectories themselves, PSA produced distinct 
clusters that appeared to broadly distinguish between dynamical and non-dynamical 
path sampling methods. With the help of more specialized analyses to be described 
next we sought to further rationalize the observed pattern of clustering. 


Native Contacts Analysis. We performed two dimensional NGA on trajectories by 
measuring (for each conformer snapshot) the fraction of native contacts relative to the 
closed starting state (Qiake) and to the open target conformation (( 54 ake) as collective 
variables (Fig. Using the NC trajectories, we examined the dynamic relationship of 
contact formation and breaking for each method. In general, the closed —>■ open 
trajectories began on or near the right vertical axis, corresponding to the first 
conformers of the paths having (nearly) 100% of their contacts in common with the 
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closed structure and around 95% of open state contacts. Most trajectories terminated at 
the top horizontal axis with the final conformers containing close to 100% of the final, 
open 4AKE:A structure contacts and about 93% of 1AKE:A contacts. The starting 
conformers of the DIMS NC paths only contained 96% of the contacts seen in the lAKE 
crystal structure (Qiake = 0.96), which is to be expected given that the initial closed 
structure was energy-minimized and equilibrated prior to performing MD. 

The five dynamical methods—DIMS, rTMD, ERODA, MDdMD, and 
GOdMD—produced somewhat noisy paths where the fluctuations took place along a 
positively sloping direction in NC space. A positive slope implies that contacts were 
simultaneously formed or broken relative to both native structures, which can be taken 
to be indicative of passage through a transition state that is distinct from either end 
state conformation. DIMS trajectories did not exactly reach the target structure 
(Q 4 ake < 0.98) as DIMS simulations were considered complete as soon as a conformer 
was within 0.5 A heavy atom (non-hydrogen) rmsd from the target crystal structure 
4AKE. MDdMD paths partly overlapped with DIMS paths during contact breaking but 
failed to reform them ((54ake < 0.94); as with DIMS, transition completion is 
determined by a cutoff—manually set to 1.5 A rmsd—due to the difficulties of 

convergence to a target using the soft-ratcheting biasing approach in MDdMD. DIMS 
and MDdMD broke a similar number of contacts relative to both states (around 8-9% 
and 9-10%, respectively). rTMD-S showed qualitatively similar behavior but broke up 
to about 12% of native contacts. The closely-knit cluster of DIMS, MDdMD and 
rTMD-S paths produced by PSA reflects the qualitative similarity of their NC 
trajectories. DIMS, MDdMD and ERODA all generated noisy, V-shaped NC pathways 
suggestive of a transition region and supports the picture from PSA where these three 
methods form a loose cluster apart from the non-dynamical methods. ERODA clustered 
somewhat apart from the other three, which correlates with the observation that 
ERODA trajectories in NC space exhibited the greatest contact breaking (Qiake = 0.82, 
Q 4 ake = 0.80) of all methods tested. This behavior is not unexpected because ERODA 
achieves random motion by randomly displacing and rotating rigid units of the protein 
at the sub-amino acid level at each step prior to re-enforcing geometric constraints. As 
such, Cq fluctuations and, thus, native contact dynamics that would be prohibited by 
conventional potentials are permitted by the geometric model although constraints on 
the overall sequence and structure would nevertheless limit dramatic perturbations to 
the Ca rmsd. COdMD paths, though quite noisy, followed a path more closely 
resembling those from the non-dynamical methods, particularly MAP and Morph. 

Morph, Linint, and two of the five ENM-based methods (ANMP and iENM) 
produced the shortest NC trajectories progressing directly to the target conformation 
with relatively little wandering, whereas the six MENM paths deviated noticeably 
toward the DIMS and ERODA trajectories in the latter half of the transition; MAP 
paths were also nearer the MENM pathways in location and shape than to the paths 
from the other ENM-based methods. The MENM paths and two MAP paths were 
unique among the non-dynamical methods in that they each contained a V-shaped, 
cusp-like feature where extra 4AKE:A contacts were broken ((54ake ~ 0.91, 0.91 and 
0.92, respectively) that were subsequently reformed toward the end of the transition. 
The rTMD-E NC path was situated in an intermediate position between the other 
dynamical methods and the non-dynamical methods. Initially, only 1AKE:A contacts 
that do not exist in 4AKE:A were broken. Then the missing 4AKE:A native contacts 
were formed. The Morph, Linint, ANMP and iENM paths, which were divided between 
two clusters in PSA, exhibited progress along negatively sloped NC space trajectories 
during which 4AKE:A contacts were formed while 1AKE:A contacts were 
simultaneously broken. However, the close structural correspondence between MAP, 
rTMD-F, and Morph paths in PSA was not recapitulated in NCA. On the other hand, 
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Figure 7. Projections of trajectory 2 of the AdK closed —open transition from each path-sampling method 
onto low-dimensional collective variables. The location of the initial structure is shown in each plot by the 
green circle, while the final structure is represented by the red diamond. (A) Projection of all pathways from 
the various path-sampling methods onto NC space. The horizontal axis corresponds to the percentage of 
contacts (of a transition snapshot) shared with the initial 1AKE:A structure (green circle) and the percentage 
of contacts in common with the final 4AKE:A structure (red diamond) is displayed on the vertical axis. The 
top-left legend identifies EN-based methods; the other methods are listed in the bottom legend. The Linint 
path is shown for reference as a broken black curve. (B) Projection on NMP angle (0nmp) vs LID angle 
(0lid)- In B and C, trajectories generated by the dynamical methods (DIMS, rTMD, FRODA, MDdMD, 
GOdMD) are plotted with diamonds and non-dynamical method trajectories with circles. (C) (-p space 
projection using Linint as the reference path. Trajectory progress in (J-p space is from left to right from 
higher to lower values of the progress variable MDdMD terminates at 1.5 A Ca rmsd from 4AKE (red 
diamond); DIMS MD terminates at 0.5 A heavy atom rmsd. 
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the ANMP paths, which were reasonably similar to iENM in PSA (1.4 A < Sp < 2.7 A 
in Fig. but fairly different from Morph (2.8 A < Sp ^ 3.1 A), appeared fairly similar 
to both iENM and Morph in NC space. 

Comparison of the NC projections of rTMD-S and rTMD-F indicated that the 
pulling velocity in rTMD directly affected the degree to which native contacts were 
broken and reformed. Consequently, different transition pathways were followed, as 
indicated by PSA, where rTMD-S clustered with the other dynamical methods and 
rTMD-F was most similar to Linint and Morph. 

NCA identified the same general division between the dynamical and non-dynamical 
methods as PSA, while some subdivisions within the dynamical/non-dynamical 
dichotomy are also borne out by both analyses, such as the closer grouping of 
MDdMD/rTMD-S/DIMS than FRODA/DIMS or FRODA/MDdMD. However, the 
cusp-like feature and overall qualitative similarity of the MENM and MAP trajectories 
in NC space that set them apart from the other non-dynamical methods is apparently 
not captured in PSA. The NC projection did not offer a clear hint as to why ANMP, 
iENM, and MENM-SD/SP were subdivided as they were in PSA and why GOdMD 
appeared as an outlier—two questions addressed by the following analysis of the 
transition paths projected onto C-p and angle-angle coordinates. 


Projections into (-p and angle-angle space. PSA and NCA are both general 
transition path analysis methods that do not require knowledge of any system-specific 
order parameters or collective variables. We employ the projection (distance from 
and progress along the path of linear interpolation) in order to resolve the remaining 
apparent discrepancies between PSA and NCA. Because good collective variables are 
known for the AdK transition 12 , we also use a 2D projection onto domain angles 99 


to connect the conclusions derived from the general analyses to visually intuitive 
structural motions of the closed —?► open transition. 

In the C-p space projection (Fig. the dynamical methods tended to obtain the 
greatest distance from the Linint reference path near the end of the transition 
(C ^ 3.5 A) whereas the non-dynamical methods peaked nearer the beginning. Thus, 
the dynamical/non-dynamical method dichotomy previously observed in both NCA and 
PSA was also present in (j-p space. The structural interpretation of this behavior is, 
based on the projection into angle-angle space (Fig. that the dynamical methods 
favored a pathway during which first the LID domains opens, followed by the NMP 
domain. Non-dynamical methods produced either NMP-opening-first paths or paths 
with brief LID-opening motions. In ^-p space, however, dynamical methods produced 
paths with a greater average and peak (orthogonal) displacement from Linint than 
non-dynamical methods (which cannot be discerned by apparent displacements in 
angle-angle space), further corroborating the clusterings from PSA. 

Fast-pulling rTMD (rTMD-F), as a dynamical method, appeared as an exception to 
the dynamical/non-dynamical method dichotomy. However, both the projection onto 
domain angles and especially the (-p projection clearly showed that the rTMD-F path 
was very similar to Linint (p < I A in Fig. [^). rTMD with very high pulling velocities 
of the restraint potential moves the system almost exlusively in the direction of the 
restraint force. For an rmsd restraint, the gradient points exactly along the Linint path. 
Therefore, rTMD-F functions more like Linint or Morph and less than equilibrium MD 
with an additional bias potential and hence PSA clustered rTMD-F with Linint and 
Morph (Fig. 1^. 

MENM-SP was the most distant member in the cluster of the four ENM-based 
methods in PSA (Fig. [^. Careful inspection of both angle-angle space (Fig. [^) and 
C-P (Fig. [^) revealed that the MENM-SP path contained a very large gap in the 
trajectory snapshots; the penultimate conformer was located in the first half of the 
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transition (^ > 4 A), while the final snapshot was the open crystal structure end state. 
Such a big gap in the path affects the discrete Hausdorff/Frechet distances because the 
distance between two MENM-SP paths with well-aligned gaps is unaffected whereas the 
distance between an MENM-SP path and one without gaps tends to be somewhat larger 
due to large point distances originating from the latter’s conformers in the portion of 
the transition where the gap occurs. ANMP was also somewhat of an outlier within the 
ENM cluster (Fig. [^, which can be traced to its path being much farther away from the 
Linint reference than any other ENM/Morph method {p « 2.2 ... 2.3 A versus 
P~ 1-3 A; Fig.[^). Structurally, the NMP domain opened nearly all the way before 
much of the LID motion took place, in contrast with every other method (Eig.[^). 

GOdMD produced the path with the greatest peak displacement {p « 2.8 A; 
Fig.[7p), corresponding to complete LID opening before substantial NMP movement 
occured (Fig. 133). The results from GOdMD are unlike any of the other methods and 
therefore GOdMD is well-classified as an outlier by PSA (Fig.[^. 

PSA was able to group fast transition path sampling methods into distinct clusters. 
These groupings could be rationalized by employing projections on more specialized 
collective variables. An important observation was that transition paths were most 
similar to other transition paths generated by the same method. This conclusion was, 
however, based on a small sample of three paths per method. We then sought to extend 
our analysis to larger ensembles of paths that would provide a statistically more 
meaningful comparison. 


Comparing DIMS and FRODA transition ensembles 


We applied PSA to transition path ensembles containing hundreds of trajectories to 
highlight several approaches to handling the statistical nature of dynamical 
path-sampling methods and illustrate the portability of our analyses to other systems. 
Ensembles of the AdK and DT closed —)■ open transitions were analyzed. DT was 
selected in part to make contact with a previous study by Farrell et al. as well as 
provide a more challenging example to demonstrate the ease with which PSA can filter 
erroneous trajectories from an ensemble. We focused on two methods, DIMS MD and 
FRODA, because they differ fundamentally in their energetic considerations yet still 
share several salient features: Heavy-atom representations were used for both methods 
for both AdK and DT. Both methods can generate path ensembles by employing a form 
of stochastic dynamics, and they drive transitions (toward a target structure) with 
similar rmsd-to-target progress variables (DIMS uses the heavy-atom rmsd-to-target for 
the soft-ratcheting coordinate; FRODA attempts to gradually decrease the Gq, rmsd to 
the target). Furthermore, our in-house implementations of DIMS MD methods allowed 
us to efficiently generate large numbers of transitions. Four unique ensembles and 800 
total trajectories were generated: 200 pathways per method per protein. Details about 
trajectory alignment for both AdK and DT is provided in |S6 Te^ of the Supporting 
Information. 

For AdK, transition path trajectories generated with DIMS formed one cluster that 
was distinct from a second cluster containing all FRODA trajectories (see S8 Fig in the 


Supporting Information). The mean Frechet distance {5p) between DIMS and FRODA 
trajectories was 2.9 ±0.1 A, significantly higher than the mean within the FRODA 
(2.2 ± O.I A) and DIMS ensemble (1.4 ± 0.2 A). DIMS generated paths with smaller 
Frechet distances among themselves than FRODA, while paths produced by a given 
method were notably more similar among themselves than when compared with paths 


from the other method, with no difference between Frechet and Hausdorff distance (SIO 


|Fig| A). These observations imply that while FRODA produced paths that sampled a 
larger region of AdK’s configuration space than DIMS, each method generated a unique 
pathway that can be viewed as a tube in configuration space whose diameter was 
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Figure 8. Clustered heat map comparing ensembles of diphtheria toxin (IMDT to IDDT) 
transition pathways produced by DIMS (blue bars) and FRODA (red bars) using the 
Frechet distance Sp. All clusterings are produced using the Ward’s criterion in ascending 
distance order; incomplete trajectories were filtered and not displayed (see text). The 
insets show that five FRODA trajectories cluster together with the DIMS ensemble. 


smaller than the typical distance between the tubes. 

While the AdK analysis was relatively straightforward, the DT heat map 
immediately revealed nine erroneous FR ODA tra jectories producing Frechet distances 
upwards of 5 A from any other path (see S9 Fig for the original clustering). Erroneous 
paths were removed by specifying a distance cutoff and re-clustering using the trimmed 
FRODA ensemble. Visual inspection of the omitted trajectories confirmed that they 
either stopped short of the target or that they came somewhat near the target but 
continued to dramatically wander in its vicinity. All the DIMS trajectories and most of 
the FRODA trajectories formed two large, separate clusters (Fig. [^. Interestingly, five 
FRODA paths were among the cluster of DIMS paths (Fig. [^insets) whereas there was 
no intermixing between DIMS and FRODA among AdK trajectories ( |S8 Fig ). 
Furthermore, the DIMS-DIMS, FRODA-FRODA, and DIMS-FRODA distributions of 
Frechet (and Hausdorff) distances for the DT transitions overlap ( SlO Fig| B), which 
indicates (via the triangle inequality of the metric) that DIMS and FRODA paths are 
situated in each others’ vicinity. 

The observation of intermixing supports the idea that the geometrical targeting 
procedure of FRODA is able to sample the space of trajectories accessible to the force 
field-based DIMS MD. As FRODA is guaranteed to produce stereochemically correct 
transitions without regards to the energetics, the set of possible paths that it can 
produce ought to be a superset of all trajectories produced by other sampling 
algorithms that, in addition to preserving stereochemistry, implement more detailed 
energetics that would further constrain the sampling space. The finding that several 
FRODA trajectories cluster with the DT DIMS ensemble hints at the possibility that 
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there exists a hierarchy of accessible path spaces that can be sampled by different 
methods depending on the energetics. 

The fact that no intermixing was observed in the AdK transitions does not 
necessarily contradict the hypothesis that FRODA trajectory space contains DIMS 
trajectory space, because the specifics of the AdK transition and the nature of the 
FRODA sampling algorithm seem to make it less likely that FRODA would actually 
sample a DIMS pathway (as discussed in more detail in S7 Text). It is not immediately 
obvious how one would tune one particular path-generating algorithm (such as FRODA) 
in order to increase its likelihood to produce a path characteristic of another algorithm 
(such as DIMS), although we already observed that the variation of the pulling speed in 
the rTMD method led to quantitatively (Fig. and qualitatively (Fig.[^ different 
paths. In particular, fast pulling (rTMD-F) generated paths similar to linear 
interpolation (Linint), whereas slow pulling (rTMD-S) paths were more similar to DIMS 
and MDdMD. Thus, although we do not yet in general understand the relationship 
between the pathways sampled by different algorithms, PSA appears to be a useful tool 
to tackle this question. 

The ultimate goal is, of course, to find a method that reliably samples transitions 
realized in the real system. The analysis presented here should also aid in identifying 
the overlap between different sampling methods and experimental data (e.g. from 
femtosecond structural biology experiments) when such data become available. 


Hausdorff pairs connect PSA to molecular detail 

PSA is a general approach that can operate on the full 3A^-dimensional trajectories 
without requiring any system-specific knowledge. It provides a very broad means to 
categorize transitions as distinct from one other. But as described so far, it is difficult 
to relate the global PSA analysis to physically relevant differences at the molecular level. 
To address this question we introduce the new concept of “Hausdorff pairs” (or “Frechet 
pairs”) that allows us to pinpoint conformations that may be more likely to exhibit 
geometric (structural) features relevant to conformational change. 

By construction, the Hausdorff and the Frechet distances identify a point-wise 
distance between two particular conformers, one on each path, as the global distance 
between the paths. The path metrics therefore induce a map between a conformer on 
one path to a conformer on another whose separation distance is, in some sense, a 
maximal deviation between the paths. We term such a pair of conformers a Hausdorff 
pair ((5//-pair) or a Frechet pair (J^T’-pair). These conformers can be examined at the 
molecular or atomic level to reveal the specific structural discrepancies that give rise to 
large deviations in configuration space between pairs of paths. 

As an explicit example, we identified three Hausdorff pairs for the DIMS and 
FRODA closed —?► open AdK transition ensembles and projected them in AA space 
(Fig. 1^). We first segregated the full set of Hausdorff distance measurements into: (1) 
mutual distances among DIMS paths, (2) mutual distances among FRODA paths, and 
(3) inter-method distances measured between a DIMS and a FRODA path. A total of 
N[N — l)/2 = 79800 i5i/-pairs were identified for the ensemble of = 400 paths. In 
order to present representative data for the whole ensemble, we identified the two 
i5i/-pairs associated with the median and maximum Hausdorff distances for each 
comparison (1), (2), and (3) as defined above. 

As a typical example we explicitly examined the median i5ij-pair identified for the 
inter-method comparisons and projected the atomic displacements onto each structure 
to locate regions of large deviation (see structures in Fig.[^). It became apparent that 
the NMP domain in the DIMS structure was closer to the LID domain because a 
number of evolutionary conserved salt bridges (D33-R156, R36-D158, D54-K157) 
persisted late into the transition due to the strong electrostatic interaction between the 
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Figure 9. “Hausdorff pairs" ((5iy-pairs) analysis using 200 DIMS (cyan) and 200 FRODA (light green) trajectories projected into 
AA space. Hausdorff distances were computed for all unique path pairs. (A) Conformer pairs—corresponding to the fc-pairs 
with the median and maximum Hausdorff distances (solid and dashed lines, respectively)—are projected onto the domain angle 
space for the following comparisons: DIMS-FRODA (purple), DIMS-DIMS (red), and FRODA-FRODA (blue). Experimental 
crystal structures, including some intermediates, are shown as stars [99[ , with further details available in SI Table Insets: Two 
heavy-atom representations are shown for the median Jn-pair between a DIMS path and FRODA path, corresponding to 
snapshots from the respective trajectories. The magnitude of the displacement vector between the two conformations is 
projected onto each atom. Color bar units for atomic displacement are in Angstrom. The initial and final conformations (green 
circle and red diamond, respectively) are shown along with the linear interpolation path Linint - black dashed line) for reference. 
(B,C) Salt bridges in the DIMS and FRODA conformers from the DIMS-FRODA median Hausdorff pair. Three LID-NMP salt 
bridges (R156-D33, D158-R36, and K157-D54) and a CORE-NMP salt bridge (E170-K57) are intact in the DIMS structure (B) 
that are broken in the FRODA structure (C). The residues responsible for these salt bridges tug on the NMP domain more 
substantially than their counterparts in the LID domain, which are located toward the base of the LID. 
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acidic and basic moieties (Fig. |^). FRODA, on the other hand, operating on purely 
geometric principles and neglecting Coulomb interactions, does not account for the 
influence of salt bridges on the transition and the associated 5//-pair structure exhibited 
broken salt bridges in the inter-LID/NMP region (Fig. [^). It is therefore not surprising 


that the FRODA trajectory did not show the “salt-bridge zipper” 99 , which manifested 


itself as discerning difference between the DIMS and FRODA trajectories. With salt 
bridges located across the NMP domain but primarily on the side of the LID, the LID is 
relatively free to move to an open configuration, whereas the NMP domain is prevented 
from fully opening until the salt bridges are broken. These considerations are consistent 
with the tendency of DIMS paths to primarily favor a LID-opening pathway (Fig. 
blue circles), while FRODA paths (Fig. green circles) sampled the region around 
Linint (Fig. black dashed line) corresponding to simultaneous LID/NMP-opening. 

A Hausdorff pair describes the two frames at which the two trajectories in question 
differ most. Additionally, the regions where trajectories differ to varying degrees from 
each other might also be of interest. This kind of information is provided by the the set 
of nearest neighbor distances along a path. Eq. defines the nearest neighbor distance 
of point pfc on path P from path Q as 6h{k; P \ Q) := Sh{pk I Q) ■= min^gg d{pk,q) and 
the nearest neighbor distance of point qk on path Q from path P as Sh{k; Q \ P). In 
general, these two distances are not symmetric, i.e. 6h{k;P \ Q) ^ 5h{j\Q \ P) for any 
conformations j,k. When 6h{k{^)]P \ Q) and 5h{j{0:Q I P) are plotted against a 
suitable common order parameter the regions of large and small differences between 
trajectories can be quantified. For example, in |Sll Fi^ the nearest neighbor distances 
of the three pairs of trajectories corresponding to the median Hausdorff pairs in Fig. 
showed that the DIMS and FRODA trajectories primarily differed in the first ^ 60% of 
the transition, which corresponds to LID-opening in DIMS and simultaneous 
LID/NMP-opening in FRODA. The DIMS trajectories differed almost uniformly along 
the whole path by only ^ 1.3 A, suggesting that they follow a similar path perturbed by 
thermal fluctuations. The FRODA trajectories differed by ~ 2 A during the middle half 
of the transition but practically coincided at beginning and end, showing that FRODA 
can accurately connect two given endpoint structures even with its stochastic 
component enabled. 

The Hausdorff-pair and nearest neighbor distance analysis naturally followed from 
the formulation of PSA. Even though only Cq, atoms were used to distinguish DIMS 
from ERODA trajectories and hence the level of detail of PSA was primarily restricted 
to conformational differences in the protein backbone, atomic-scale analysis of 
Hausdorff-pairs was able to reveal the molecular determinants responsible for the 
structural differences. 


Conclusions 

Summary. We developed a flexible and quantitative framework for analyzing 
macromolecular transition paths using path metrics as a means to measure the mutual 
similarity of paths in configuration space, potentially using the full 3A^-dimensional 
configuration space information. As far as we are aware, there is currently no standard 
procedure for quantitatively analyzing and characterizing transition paths. After 
comparing a set of transitions from a variety of path-sampling algorithms and analyzing 
transition ensembles of two sets of dynamical, stochastic trajectories, PSA’s viability as 
a tool to quantitatively compare transition paths appears promising. 

In particular, PSA demonstrated that for the AdK closed —>■ open transition, fast 
path sampling methods generated paths that were more similar to other paths produced 
by the same method than to any other paths, suggesting that among these methods 
there is currently no real consensus for what a realistic conformational transition looks 
like. Hierarchical clustering in combination with a heat map representation indicated 
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broad patterns whereby dynamical methods tended to be clustered with each other 
while non-dynamical ones (especially most of the ENM-based ones) were more similar to 
each other than to methods such as DIMS or FRODA. The clustering was qualitatively 
confirmed by a range of low-dimensional projections on collective variables, which can 
be used synergistically with PSA once additional knowledge about the system of interest 
is available. 

Analysis of ensembles of closed —>■ open transition of DT produced with the 
computationally very distinct DIMS and FRODA methods showed that in about 2.5% of 
the cases, FRODA sampled a similar region of path space as DIMS. The finding suggests 
a picture of path space where different methods preferentially generate trajectories in 
their own region even though in rare cases neighboring regions are sampled. It will be of 
interest to determine regions of this space where multiple methods overlap and evaluate 
such consensus regions as candidates for more realistic transition pathways. The 
ensemble analysis also clearly showed that at least for large scale macromolecular 
transitions, the Frechet and the Hausdorff distance are equally appropriate measures for 
path similarity, with the Hausdorff distance being cheaper to compute. 

A key advantage of PSA is its generality in that no system specific knowledge is 
required and all trajectory data can easily be used. That generality might, however, 
obscure the physical and biological detail that is important for a mechanistic 
understanding of protein function. The new concept of Hausdorff (or Frechet) pairs 
within the context of PSA suggests an effective approach to identifying the molecular 
determinants responsible for differences in paths between sampling methods, which 
could be related to variations in protein function, differences in path-sampling 
algorithms/models, or some combination of both (as demonstrated for the comparison 
between DIMS and FRODA). 


Future directions. The rmsd proved a useful choice for the point metric to measure 
structural similarity as its preponderance in the literature helps to connect path metric 
distance measurements to familiar intuitions although its interpretation is not entirely 
obvious across disparate contexts 120 . A further study would, for instance, benefit 


from a revised definition of native contacts where consecutive alpha carbons (within the 
cutoff distance) would be ignored. As such, a revised native- or self-contacts measure 
could be used as a point metric instead of rmsd. We also plan to employ k-medoids 
clustering—used to identify a “median” element (i.e., a medoid) in a set—as one 
possible approach to identifying representative transition paths, bringing us a step 
closer to identifying transition tubes or candidates for reaction coordinates. 

Furthermore, PSA can aid the assessment of enhanced path-sampling algorithms and 
their performance by quantifying the degree of similarity to gold-standard transition 
paths (for instance, equilibrium MD transitions or—when available—experimental 
time-resolved structural data). Such quantitative comparisons would be key to assess 
the physical plausibility of transitions generated by enhanced sampling methods. 
However, comparison of such transition paths to true equilibrium paths will require the 
development of a method to identify the actual transition events in the equilibrium data 
and to distinguish them from residency in (meta) stable states, ideally without 
introducing problem-specific order parameters. Possible approaches may include an 
analysis of the distribution of Hausdorff nearest neighbor distances or discrete Frechet 
couplings or techniques to match subtrajectories 


121 122 but identifying barrier 


crossing events from coordinate trajectories in a general manner alone remains an open 
problem. By extending the analysis of Hausdorff pairs we should also be able to better 
pinpoint key structural events or mutations that affect the function of biomolecules and 
so improve our understanding of the connection between protein structure, dynamics 
and function. 
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Supporting Information 

51 Text 

Alternative distance functions: average Hausdorff and discrete average 
Prechet. This text explores variations of the Hausdorff and discrete Frechet metrics 
based on averages rather than maxima, which are intended to reduce sensitivity to 
outliers. Definitions are provided and are shown, via example, to violate the triangle 
inequality. We repeat the path-sampling methods comparison and discuss their behavior 
in the context of PSA. 

52 Text 

Comments on the numerical implementation of path metrics. We informally 
discuss and comment on the numerical aspects of computing transition path similarity 
and the algorithms used to calculate Hausdorff and Frechet distances. 

53 Text 

Comments on the selection and validation of clustering algorithms. In this 
text we mention qualitative and quantitative considerations in selecting the Ward 
linkage criterion for hierarchical clustering, in the context of other linkage criteria. Some 
comments on potential data interpretation pitfalls when performing general cluster 
analyses are provided with a view toward viable approaches to PSA cluster and data 
validation. 

54 Text 

Mathematical details for the energetics and dynamics of the double-barrel 

model. Mathematical details are provided for the simulation of the double-barrel 
model. The system assumes overdamped Langevin dynamics (Brownian motion) and 
numerical integration was performed using a first-order scheme in time. The 
construction of the model permits consistent coarse-graining with respect to the number 
of particles in a cluster, effectively allowing tuning of the number of degrees of freedom, 
or the dimensionality of the configuration space. 

55 Text 

Expanded overview of the transition path generating algorithms used in 
this study. Here we provide a short review—for reader convenience—of the transition 
path generating algorithms used in the comparison of sampling methods. We summarize 
the key aspects of each of the physical models and path generating algorithms to help 
lay the groundwork for connecting algorithmic/model differences to differences between 
the respective transition paths that were produced. 

56 Text 

Details of structural alignment procedures for protein alignment prior to 
path similarity analysis. This text summarizes the considerations involved in 
structurally aligning conformer snapshots prior to running path similarity analysis on a 
set of transition paths. We provide specific details and motivations as to the alignment 
procedures used for AdK and DT trajectories. 
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S7 Text 


Discussion of FRODA trajectories as a superset of other 
trajectory-generating algorithms. In this text we qualitatively discuss the 
difference between the adenylate kinase (AdK) and diphtheria toxin transition (DT). 
These differences rationalize the observation that for the DT transition some FRODA 
trajectories overlap with the ensemble of DIMS trajectories whereas no such overlap was 
observed in the case of AdK. 


SI Fig 


Effect of temperature and dimensionality on the distribution of path 
metrics. Violin plots 


91 show the distributions of discrete Frechet distances for 


double-barrel simulations of one particle (orange) and eight particles (purple) for 
temperatures ranging between 0 and 500 K in 50 K increments (panels A-K) and at 
600 K (panel L). Black points correspond to individual Frechet distance measurements, 
with distance units in nm rmsd. A kernel density estimate (kde) is shown for each N, T 
pair to qualitatively emphasize the behaviors of the distributions across the entire 
temperature range; the bandwidth for each pair is explicitly set to produce two distinct 
distributions at low temperatures and gradually increased to generate smooth, single 
distributions at high temperatures. The separated distributions at low temperatures 
merge between 300 K to 450 K, with the eight-particle simulations merging toward 
higher temperatures relative to the one-particle simulations. 


S2 Fig 

Correlation analysis of Frechet and Hausdorff distances in the toy model. 

Regression analyses examining the correlation between corresponding Frechet 
(horizontal axes) and Hausdorff (vertical axes) distance measurements are plotted along 
with the joint distributions plots for double-barrel simulations of one particle (orange 
points) and eight particles (purple) for temperatures ranging between 0 and 500 K in 
50 K increments (panels A-K) and at 600 K (panel L). Scatter points correspond to 
individual Frechet distance measurements in nm rmsd and are plotted with the line 
produced by linear regression. The shading about the regression lines correspond to a 
95% confidence interval. Kernel density estimates (kde) are shown for each N, T pair 
and are computed using the same set of bandwidth constants specified in SI Fig The 


separated distributions at low temperatures merge between 300 K to 450 K, with a 
notable narrowing of the range of distance measurements occuring between 400 K to 
450 K. 


S3 Fig 

Effect of temperature and dimensionality on the correlation between 
Frechet and Hausdorff distance. Coefficients of the Pearson correlation between 
Hausdorff and Frechet distances for one- and eight-particle simulations plotted as a 
function of temperature. Path distances remain well correlated up to 300 K and are 
least correlated at 500 K, with the one-particle simulations exhibiting a substantially 
larger drop in correlation. At the highest temperature the central barrier becomes 
negligible and the simulations start to equally sample a single tube dominated by the 
steep repulsive walls. Therefore, the paths become more similar again between the 
N = 1 and N = 8 clusters and the correlation coefficient increases. 
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54 Fig 

Temperature-dependent transition from two to one distinct paths in the 

toy model. The means and standard deviations of the discrete Frechet (blue) and 
Hausdorff (red) distances for double-barrel simulations of one particle (A) and eight 
particles (B) are shown as functions of temperature. Measurements for simulations at 
250 K and below were divided into an upper and lower distribution by separating 
distance measurements above and below a 1.25 nm cutoff. Above the temperature 
cutoff, all measurements were treated as part of the same distribution. Both the Frechet 
and Hausdorff metric lose the ability to distinguish between the two barrels as the paths 
begin to wander out of well-defined pathways when the temperature is on the order of 
the equivalent energy of the central barrier {2kBT at 300 K). At higher temperatures, 
thermal perturbations become large relative to the barrier, permitting particle clusters 
to explore the full width of the potential spanning both barrels so as to generate 
trajectories confined to a single, unified pathway. 

55 Fig 

Annotated Frechet distance matrix of AdK transition trajectories 
generated by different path-sampling methods. The Frechet distance matrix 
from Fig. is shown with the numerical values of (rounded to one decimal) 
superimposed. Due to the size of the distance matrix, the high resolution image is 
provided as a simple means for online data exploration with the help of the zoom 
function of an image viewer. 

56 Fig 

Influence of the linkage algorithm on the clustered PSA comparison of 
different path sampling methods. Different linkage algorithms were used to cluster 
the Frechet distances produced by path-sampling methods for the AdK closed —>■ open 
transition. Smaller distances (in units of A rmsd) indicate transition paths with greater 
similarity. Dendrograms for each heat map correspond to the hierarchical clustering 
produced by the single (A), complete (B), average (C), and weighted (D) linkage 
algorithms, and depict a hierarchy of clusters with smaller node heights of parent 
clusters indicating greater similarity between child clusters. 

57 Fig 

PSA comparison of different path sampling methods based on the 
Hausdorff distance. (A) Heat map for path-sampling methods for the AdK closed —> 
open transition of Hausdorff distances produced using the Ward algorithm. Clusters are 
identical to the Ward clustering for Frechet distances in Fig. (B) Correlation and 
joint distributions between discrete Frechet versus Hausdorff distance measurements (in 
A rmsd) for the AdK closed —>■ open methods comparison. Strong linear correlation 
indicated by the scatter plot, with a Pearson correlation coefficient very close to unity, 
indicates that either metric could have been used to perform the path-sampling 
methods analysis with essentially identical results. A slight deviation of the scatter 
points below the line of unity slope is consistent with the fact that Frechet distances are 
bounded from below by corresponding Hausdorff distances. 

58 Fig 

Clustered PSA heat map of AdK transition ensembles. Clustered heat map 
comparing path ensembles of adenylate kinase (1AKE:A to 4AKE:A) transition paths 
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produced by DIMS (red bars) and FRODA (blue bars) using the discrete Frechet 
distance 6f- Clustering was produced using the Ward algorithm in ascending distance 
order. 


S9 Fig 

Clustered PSA heat map of raw DT transition ensembles. Clustered heat map 
comparing the raw path ensembles of diphtheria toxin (1MDT:A to 1DDT:A) transition 
paths produced by DIMS (red bars) and FRODA (blue bars) using the discrete Frechet 
distance Sp- Clustering was produced using the Ward algorithm in ascending distance 
order. Nine erroneous FRODA paths (orange cluster) were very distant from all other 
paths—all nine were removed from the ensemble and to produce the heat map 
dendrogram in Fig. 

510 Fig 

Correlation between Frechet and Hausdorff distance in ensemble 
comparisons. Correlations and joint distributions of discrete Frechet versus Hausdorff 
distance measurements (in A rmsd) of the AdK (A) and DT (B) ensemble analyses are 
shown. Measurements are divided into three separate distributions: (1) mutual 
distances among DIMS paths (red), (2) mutual distances among FRODA paths (green), 
and (3) inter-method distances measured between a DIMS and a FRODA path (blue). 
Both scatter plots show strong correlation between the path metrics for all the 
distributions, with Pearson correlation coefficients equal to unity and p-values equal to 
zero to two decimal places, indicating that either metric could have been used to 
perform the path-sampling methods analysis to obtain essentially identical results. A 
slight deviation of the scatter points below the line of unity slope is consistent with the 
fact that Frechet distances are bounded from below by corresponding Hausdorff 
distances. DIMS simulations exhibited less variation than FRODA in the AdK 
transition, but had a larger average variation in the DT transition. In both cases, 
inter-method DIMS-FRODA comparisons were substantially larger than comparisons of 
pairs of paths from a single method. 

511 Fig 

Nearest neighbor distances along trajectories for the median Hausdorff 
pairs in the AdK ensemble comparison. The nearest neighbor distances 

Sh{k;Q I P) (solid line,-) and Sh{k-,P \ Q) (dashed line,-) between pairs of 

paths P/Q belonging to the three median Hausdorff pairs in the AdK ensemble 
comparison (Fig. 1^) are shown for DIMS/FRODA (purple), DIMS/DIMS (blue), and 
FRODA/FRODA (green). The largest value maxfcj \5h{k\Q \ P),Sh{j;P \ Q)) is the 
actual Hausdorff distance. For illustration purposes, nearest neighbor distances are 
plotted as a function of frame number k normalized to the interval [0,1] (i.e., k/\P\), 
where 0 (1) corresponds to the first (last) frame. In general, an appropriate 
one-dimensional order parameter should be chosen in order to plot nearest neighbor 
distances for structurally corresponding trajectory frames. 


SI Table 


Numerical values for domain angles for experimental crystal structures of 

AdK. The NMP-CORE angle and LID-CORE angle for E. coli EcoAdK or models of 
EcoAdK based on crystal structures of homologous proteins 99 was computed from the 
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Cq atoms in the NMP, LID and CORE domain as described in Methods. The angles 
are plotted in Fig.[^. 
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